This paper proposes a new method, OFA-OCR, to transfer multimodal pretrained models to text recognition. Specifically, we recast text recognition as image captioning and directly transfer a unified vision-language pretrained model to the end task. Without pretraining on large-scale annotated or synthetic text recognition data, OFA-OCR outperforms the baselines and achieves state-of-the-art performance in the Chinese text recognition benchmark. Additionally, we construct an OCR pipeline with OFA-OCR, and we demonstrate that it can achieve competitive performance with the product-level API. The code (https://github.com/OFA-Sys/OFA) and demo (https://modelscope.cn/studios/damo/ofa_ocr_pipeline/summary) are publicly available.
translated by 谷歌翻译
Generalist models, which are capable of performing diverse multi-modal tasks in a task-agnostic way within a single model, have been explored recently. Being, hopefully, an alternative to approaching general-purpose AI, existing generalist models are still at an early stage, where modality and task coverage is limited. To empower multi-modal task-scaling and speed up this line of research, we release a generalist model learning system, OFASys, built on top of a declarative task interface named multi-modal instruction. At the core of OFASys is the idea of decoupling multi-modal task representations from the underlying model implementations. In OFASys, a task involving multiple modalities can be defined declaratively even with just a single line of code. The system automatically generates task plans from such instructions for training and inference. It also facilitates multi-task training for diverse multi-modal workloads. As a starting point, we provide presets of 7 different modalities and 23 highly-diverse example tasks in OFASys, with which we also develop a first-in-kind, single model, OFA+, that can handle text, image, speech, video, and motion data. The single OFA+ model achieves 95% performance in average with only 16% parameters of 15 task-finetuned models, showcasing the performance reliability of multi-modal task-scaling provided by OFASys. Available at https://github.com/OFA-Sys/OFASys
translated by 谷歌翻译
Contrastive Language-Image Pre-training (CLIP) has demonstrated great potential in realizing open-vocabulary visual recognition in a matching style, due to its holistic use of natural language supervision that covers unconstrained real-world visual concepts. However, it is, in turn, also difficult to evaluate and analyze the openness of CLIP-like models, since they are in theory open to any vocabulary but the actual accuracy varies. To address the insufficiency of conventional studies on openness, we resort to an incremental perspective and define the extensibility, which essentially approximates the model's ability to deal with new visual concepts, by evaluating openness through vocabulary expansions. Our evaluation based on extensibility shows that CLIP-like models are hardly truly open and their performance degrades as the vocabulary expands to different degrees. Further analysis reveals that the over-estimation of openness is not because CLIP-like models fail to capture the general similarity of image and text features of novel visual concepts, but because of the confusion among competing text features, that is, they are not stable with respect to the vocabulary. In light of this, we propose to improve the openness of CLIP in the feature space by enforcing the distinguishability of text features. Our method retrieves relevant texts from the pre-training corpus to enhance prompts for inference, which boosts the extensibility and stability of CLIP even without fine-tuning.
translated by 谷歌翻译
实现通用语言情报是自然语言处理的长期目标,标准评估基准发挥基本和指导作用。我们认为,对于通用语言智能评估,基准本身需要全面和系统。为此,我们提出了Cuge,一种中文语言理解和生成评估基准,具有以下特征:(1)分层基准框架,其中数据集主要选择和组织语言能力 - 任务数据集层次结构。 (2)多级评分策略,其中基于分层框架提供了不同级别的模型性能。为了促进CUGE,我们提供了一个公共排行榜,可以自定义,以支持灵活的模型判断标准。代表性预先训练的语言模型的评估结果表明了对通用语言智能的完善的充足空间。 Cuge在Cuge.baai.ac.cn上公开提供。
translated by 谷歌翻译
由于许多微调预先训练的语言模型〜(PLMS)具有有希望的性能,因此慷慨地释放,研究了重用这些模型的更好方法至关重要,因为它可以大大降低再培训计算成本和潜在的环境副作用。在本文中,我们探索了一种小型模型重用范式,知识合并〜(ka)。如果没有人为注释,KA旨在将来自不同教师的知识合并到一个专门从事不同的分类问题中的知识,进入多功能的学生模型。实现这一目标,我们设计了模型不确定感知知识合并〜(Muka)框架,其使用Monte-Carlo辍学来识别潜在的足够教师,以估计金色监督指导学生。实验结果表明,Muka在基准数据集上实现了对基准的基本改进。进一步的分析表明,Muka可以通过多个教师模型,异构教师,甚至交叉数据集教师概括很好的复杂设置。
translated by 谷歌翻译
学习深层分类模型的传统智慧是专注于糟糕的示例,并忽略远离决策边界的良好分类的例子。例如,当具有交叉熵损耗的训练时,具有更高可能性的示例(即,良好的良好示例)在后传播中贡献更小的梯度。然而,我们理论上表明,这种常见的做法阻碍了代表学习,能量优化和利润率的增长。为了抵消这种缺陷,我们建议奖励具有良好的奖励奖励良好的例子,以恢复他们对学习的贡献。这种反例理论上地解决了这三个问题。我们通过直接验证理论结果或通过对不同任务的实体分类,包括图像分类,图形分类和机器翻译。此外,本文表明,由于我们的想法可以解决这三个问题,我们可以处理复杂的情景,例如不平衡的分类,检测,以及在对抗性攻击下的应用。代码可用:https://github.com/lancopku/well-classification-examples-are-underestimated。
translated by 谷歌翻译
视频字幕结合了视频理解和语言生成。与图像标题不同,描述具有几乎每个对象的细节的静态图像,视频字幕通常考虑一系列帧和偏置朝向聚焦对象的偏差,例如,保持焦点的对象,无论更改的背景如何。因此,检测和适当地容纳聚焦对象在视频字幕中是至关重要的。为了执行聚焦对象的描述并实现可控制的视频标题,我们提出了一种面向对象的非自动增加方法(O2NA),其执行三个步骤中的标题生成:1)识别聚焦对象并预测其在目标字幕中的位置; 2)生成相关的属性词和这些聚焦对象的关系词来形成标题草案; 3)将视频信息组合以将标题草案精炼到流利的最终标题。由于产生了聚焦的对象并领先于其他单词,因此难以应用逐字的自回归生成过程;相反,我们采用了非自动评级方法。在两个基准数据集,即MSR-VTT和MSVD上的实验证明了O2NA的有效性,这实现了与最先进的结果竞争,但具有更高的多样性和推理速度。
translated by 谷歌翻译
在序列到序列学习中,例如,自然语言生成,解码器依赖于注意机制,以有效地从编码器中提取信息。虽然常见的做法是从最后一个编码器层绘制信息,但最近的工作已经提出用于使用来自不同编码器层的表示,以进行多样化的信息。尽管如此,解码器仍然仅获得源序列的单个视图,这可能导致由于层级绕过问题而导致编码器层堆栈的训练不足。在这项工作中,我们提出了层次的多视图解码,其中对于每个解码器层以及来自最后一个编码器层的表示,它作为全局视图,来自其他编码器层的那些是用于立体视图的源序列。系统实验和分析表明,我们成功地解决了层次结构绕过问题,需要几乎可忽略的参数增加,并大大提高了五种不同任务的深度表示的序列到序列学习的性能,即机器翻译,抽象总结,图像标题,视频字幕和医疗报告生成。特别是,我们的方法在八个基准数据集中实现了新的最先进的结果,包括低资源机器转换数据集和两个低资源医疗报告生成数据集。
translated by 谷歌翻译
Masked image modeling (MIM) performs strongly in pre-training large vision Transformers (ViTs). However, small models that are critical for real-world applications cannot or only marginally benefit from this pre-training approach. In this paper, we explore distillation techniques to transfer the success of large MIM-based pre-trained models to smaller ones. We systematically study different options in the distillation framework, including distilling targets, losses, input, network regularization, sequential distillation, etc, revealing that: 1) Distilling token relations is more effective than CLS token- and feature-based distillation; 2) An intermediate layer of the teacher network as target perform better than that using the last layer when the depth of the student mismatches that of the teacher; 3) Weak regularization is preferred; etc. With these findings, we achieve significant fine-tuning accuracy improvements over the scratch MIM pre-training on ImageNet-1K classification, using all the ViT-Tiny, ViT-Small, and ViT-base models, with +4.2%/+2.4%/+1.4% gains, respectively. Our TinyMIM model of base size achieves 52.2 mIoU in AE20K semantic segmentation, which is +4.1 higher than the MAE baseline. Our TinyMIM model of tiny size achieves 79.6% top-1 accuracy on ImageNet-1K image classification, which sets a new record for small vision models of the same size and computation budget. This strong performance suggests an alternative way for developing small vision Transformer models, that is, by exploring better training methods rather than introducing inductive biases into architectures as in most previous works. Code is available at https://github.com/OliverRensu/TinyMIM.
translated by 谷歌翻译
This paper presents a practical global optimization algorithm for the K-center clustering problem, which aims to select K samples as the cluster centers to minimize the maximum within-cluster distance. This algorithm is based on a reduced-space branch and bound scheme and guarantees convergence to the global optimum in a finite number of steps by only branching on the regions of centers. To improve efficiency, we have designed a two-stage decomposable lower bound, the solution of which can be derived in a closed form. In addition, we also propose several acceleration techniques to narrow down the region of centers, including bounds tightening, sample reduction, and parallelization. Extensive studies on synthetic and real-world datasets have demonstrated that our algorithm can solve the K-center problems to global optimal within 4 hours for ten million samples in the serial mode and one billion samples in the parallel mode. Moreover, compared with the state-of-the-art heuristic methods, the global optimum obtained by our algorithm can averagely reduce the objective function by 25.8% on all the synthetic and real-world datasets.
translated by 谷歌翻译